Engineering posts about Service Discovery
Curated summaries and key learnings for engineers working with Service Discovery.
Rethinking Distributed Systems for Serverless Performance and Reliability
The article explores the evolution of serverless compute for Apache Spark, addressing long-standing architectural challenges that have hindered performance and reliability. It emphasizes the need for...
From SSH to REST: A Security-Driven Modernization of Slack’s EMR Data Pipelines
The article outlines Slack's transition from a legacy SSH-based architecture to a modern REST-based job submission system for its data pipelines. Initially, the reliance on SSH created significant...
Rearchitecting the Workflows control plane for the agentic era
The article discusses the rearchitecting of the Workflows control plane to accommodate a shift towards agent-triggered workflows, necessitated by the increasing demand for durable execution engines...
Building a Distributed Persistent Queue That Scaled AI Workloads 5x Under LLM Rate Limits
The article discusses the engineering of a distributed persistent queue that orchestrates AI workloads and human workflows within strict infrastructure limits. It highlights the challenges of scaling...
Zero-Downtime Patching in Lakebase Part 1: Prewarming
The article discusses the challenges associated with planned maintenance in database systems, particularly focusing on the performance degradation caused by cold restarts. It introduces Lakebase's...
Multi-Cloud Challenges, Intelligent Load Balancing, and AI-Powered Workflows: Databricks at SRECon 2026
The article highlights Databricks' advancements in infrastructure reliability and efficiency as presented at SRECon 2026. It delves into the challenges of multi-cloud operations, particularly...
Scaling Jira cloud Migrations, One Bottleneck at a Time
The article chronicles the Jira Migrations team's journey in scaling their migration platform from handling 20,000 to 50,000 Monthly Paid Enabled Users (PEUs). It discusses the transition from an...
How Data 360 Optimized Kubernetes Scheduling Architecture, Delivering 13% Cost Savings
The article discusses how the Data 360 Compute Fabric team at Salesforce optimized Kubernetes scheduling to enhance resource efficiency and reduce costs. By evolving the default kube-scheduler...
How we rebuilt the search architecture for high availability in GitHub Enterprise Server
The article discusses the architectural improvements made to the search functionality in GitHub Enterprise Server to enhance high availability (HA). It highlights the transition from a clustered...
Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters
The article discusses the implementation of backend aggregation (BAG) in Meta's Prometheus AI clusters, highlighting its role in interconnecting thousands of GPUs across multiple data centers. BAG...
Welcoming Stately Cloud to Databricks: Investing in the Foundation for Scalable AI Applications
The article highlights Databricks' acquisition of Stately Cloud, emphasizing the importance of building a robust foundation for scalable AI applications. It discusses the expertise of the Stately...